Red wine exploratory data analysis by Siân Nadin

About this dataset

This project aims to explore the factors that contribute to the quality of a wine. The dataset used for this project is public available for research. The details are described in [Cortez et al., 2009].

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Description of attributes:

The dataset consists of 12 variables, with 1599 observations. The attributes are as follows:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of wine is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

We can take a look at the summary of these variables in order to get an initial overview of what the values are like:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

First I’ll take a quick look at the distribution of quality in the wine data set with the mean (in blue) and median (in black) plotted on the graph for reference.

summary (red_wine$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Most of the wines in the dataset are average quality wines with very few low and high quality wines included in this dataset. Due to the fact this data is primarily about average quality wines it might be difficult to build a predictive model that will work across varying types of wine. We might get a better understanding of the data set by plotting each of the variables to get a better idea of the spread of differences for the properties included in this dataset.

Observations from data:

Both pH and density plots have normal distributions. It is likely that these qualities might be regulated in such a way that they must fall within a particular range. Fixed acidity looks like a slightly skewed normal distribution while volatile acidity looks to be slightly bimodal. Since these two acidity values will have a large impact on the pH values I wonder if the sum of these two values would look closer to a normal distribution? We can create total acidity variable which = fixed + volatile acidity and see if it has a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.270   7.827   8.720   9.118  10.070  17.045

This plot isn’t quite the same shape as the pH curve so the volatile acidity, fixed acidity and citric acid levels aren’t the only important factors that contribute to the pH of a wine.

Citric acid levels seem to vary quite a bit. This may be due to different amounts being added to produce different flavours. There is quite a high proportion of wines with no citric acid at all. Whereas things like pH must be specifically within a certain range to produce a good wine there’s more flexibility with levels of citric acid since it is used to acheive a desired flavour.

The distribution of residual sugar is a positively skewed long tail distribution with high peaks around a value of 2g/L and many outliers present at the higher ranges. Chlorides display a similar long tail distribution but at much lower concentrations. This suggests that both residual sugar and chlorides are kept within a fairly tight range with a small amount of extreme outliers.

Free sulphur dioxide, sulphates and total sulfur dioxide follow the same positively skewed patterns with some outliers, although the outliers aren’t as extreme as those seen in the residual sugar and chlrode plots.

Alcohol also follows a skewed distribution but to a lesser extent than the other skewed plots.


Univariate Analysis

What is the structure of your dataset?

The dataset consists of 12 variables, with 1599 observations. Most of the variables have long-tail distributions. However, pH and density have normal distributions.

What is/are the main feature(s) of interest in your dataset?

I’m interseted in what factors impact the quality of wine. The quality is scored between 0-10, but we only have observations with values between 3 and 8. The majority of the wines listed have an average quality of roughly 5.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I expect citric acid, pH, residual sugar, and volatile acidity will largely contribute to the quality of a wine as these factors will likely have the greatest impact on taste.

Did you create any new variables from existing variables in the dataset?

Yes, I created total_acidity = volatile acidity + citric acid + fixed acidity. I was curious to see if the pH was close to a sum of these three parameters but the plot showed that total acidity was slightly different to pH so there are other factors affecting pH.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The plot of citric acid seemed a bit strange but the wide amount of variance is likely due to there not being strict constarints on how much or how little citric acid is used in flavouring a wine. I was surprised at how few of the plots had normal distributions. I assumed that there would be a particular recipe that wines wouldn’t deviate too much from but it seems that the elements that go in to making wine can vary quite a lot.


Bivariate Plots Section

In order to get an overview of which variable relationships might be interesting let’s plot their corrleations.

Observations from correlogram:

  1. Wine quality: Wine quality seems to be most strongly (positively) correlated with alcohol percentage, with sulphates and citric acid coming in second. The importance of alcohol percentage and quality was unexpected as I didn’t think this would be a big factorin defining qulity. Since citric acid is used to add a “freshness” to wine it’s not surprising that it has had an impact on wine quality. Sulphates are used to stop oxidation of the wine which will also contribute to it’s quality. On the other hand volatile acidity has the strongest negative correlation with wine quality. This is to be expected as high levels of volatile acidity can lead to a vinegar like taste.

  2. pH: pH is negatively correlated with fixed acidity and citric acid. Since a higher acid concentration will lead to a lower pH, this negative corrleation makes sense. However, it is positively correlated with the volatile acidity, which is slightly confusing. Let’s see if we are seeing a Simpson’s Paradox at play. The correlation chart would suggest that volatile acidity seems to increase pH but the plot below shows that Simpson’s paradox is responsible for the trend reversal. This is due to ommited variable bias which changes the overall coefficient.

  1. As expected, free sulfur dioxide and total sulfur dioxide are highly correlated. Surprisingly there is a low correlation between sulphates and total sulfur dioxide which suggests that the added sulphates don’t largely contribute towards the sulfur dioxide gas levels.

Quality vs. alcohol

In order to understand the correlatition between alcohol and quality we can first investigate what is the average alcohol level for each quality score:

##         3         4         5         6         7         8 
##  9.955000 10.265094  9.899706 10.629519 11.465913 12.094444

We can view the spread of percentage alcohol accross the different quality scores graphically using box plots:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## [1] 0.4761663

The plot shows that the higher the wine quality, the higher the mean alcohol content. The lower to average quality wines (rated 3-5) have a similar mean alcohol percentage. After you surpass the lower quality wines the mean alcohol conent steadily increases. We also see a great number of outliers here which means that alcohol alone is not a difinitve factor of whether or not a wine will be good quality.

Let’s create a scatter plot and include a simple linear model in order to get a visual representation of the trend of how alcohol affects quality.

We can clearly see that there is a trend showing that an increase in alcohol usually corresponds with a higher quality rating. In order to get a better understanding we can plot the count of how many wines with a given quality rating have a certain percentage of alcohol.

As stated earlier the amount of data collected for each of the quality ratings is not equal which makes the distribution of higher and lower grades difficult to see. In order to get a better comparison of how the amount of alcohol present varies by quality we should normalise the data:

In order to make the data we are looking at easier to comprehend we can group the qualities into three categories: Low(3-4), Medium(5-6) and High(7-8).

Quality vs. sulphates

Let’s see how sulphate levels vary acrooss qualities

##         3         4         5         6         7         8 
## 0.5700000 0.5964151 0.6209692 0.6753292 0.7412563 0.7677778

A visual reprsentaion of this:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
## [1] 0.2513971

The trend in this plot shows that better wines tend to have more sulphates. The IQR is quite small across all the of the varying qualities, which would suggest that there is a narrow range of acceptance for sulphate levels in wines. There are quite a lot of outliers in the average quality wines.

A quick look at the plot of the count of wines with varying sulphate levels confirms that higher quality wines tend to have more sulphates

Since the sulphates data is skewed a log scaled graph might be more appropriate for this variable:

Quality vs. Volatile acidity

Now let’s look at a negative correlation for quality: volatile acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## [1] -0.3905578

Again, we can see what the mean value is across the different quality ranges:

##         3         4         5         6         7         8 
## 0.8845000 0.6939623 0.5770411 0.4974843 0.4039196 0.4233333

A visual reprsentation of the mean across the different quality ratings:

Given the correlation, it was not too surprising to see the trend here. Average to higher quality wines have a lower mean volatile acidity compared to that of lower quality wines. Additionally, average to higher quality wines have a lot of outliers which suggests plenty of potentially vinegary tasting wines can still be considered high quality. The fact that there are plenty of high quality wines with a high volatile acidity would suggest that there are other important factors that come in to play when assigning quality. We also see that the interquartile range decreases in size as the quality of wine improves. This would suggest that there is an optimal range for the volatile acidity that the higher quality wines tend to fall within.

A plot of the wines grouped by the three quality categories further emphasises that the majority of high quality wines have a low volatile acidity.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Based on the results of the correlogram I expect that volatile acidity, citric acid and alcohol could be used to predict what the quality of a wine will be.
The higher quality wines have means and medians for these variables that are distinctively seperate from the lower quality wines.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I was surprised to see that pH and volatile acidity are positively correleated, since a higher pH value would result from an increase in acidity. However, this proved to be due to Simpsons paradox so there are other lurking variables that would have influenced this outcome.

What was the strongest relationship you found?

Alcohol had the strongest correlation to quality. It seems that the higher quality wines typically have a higher percentage of alcohol.


Multivariate Plots Section

Since volatile acidity and alcohol both play a large part in determining a wines quality we can plot these three factors together:

The plot shows that the lower to average quality wines are more clustered towards the left side of the graph whereas the higher quality wines appear more towards the right - at the higher alcohol percentage. In addition to this the majority of high quality wines appear below 0.6 volatile acidity whereas the lower quality wines tend to mostly fall within the range of 0.4-0.8. Therefore, this graph would suggest that one of the key factors to producing a high quality wine is to keep the volatile acidity low and the alcohol levels towards the higher end of the scale. Higher quality wine have more alcohol and lower volatile acidity.

Similarly, we can compare how alcohol and sulphates vary across quality.

The graph suggests that wines with higher alcohol content will be considered a higher quality if they also have higher level of sulphates.

Let’s compare all three of these important variables and see how they perform when seperated out into quality grades.

From this plot we can see that the high quality wines have a higher amount of both alcohol and sulphates but a lower volatile acidity than the average and low quality wines.

Linear model

Given the information gleamed from the correlogram and the subsequent plots it is plausible to believe that a good way to predict the quality of a iwne is by taking in to account its volatile acidity, alcohol content, level of sulphates, citric acid and pH. Therefore we will attempt to build a linear model with these factors:

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = red_wine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = red_wine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = red_wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid, data = red_wine)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + pH, data = red_wine)
## 
## ==========================================================================================
##                          m1            m2            m3            m4            m5       
## ------------------------------------------------------------------------------------------
##   (Intercept)           1.875***      3.095***      2.611***      2.646***      4.112***  
##                        (0.175)       (0.184)       (0.196)       (0.201)       (0.459)    
##   alcohol               0.361***      0.314***      0.309***      0.309***      0.327***  
##                        (0.017)       (0.016)       (0.016)       (0.016)       (0.017)    
##   volatile.acidity                   -1.384***     -1.221***     -1.265***     -1.284***  
##                                      (0.095)       (0.097)       (0.113)       (0.112)    
##   sulphates                                         0.679***      0.696***      0.673***  
##                                                    (0.101)       (0.103)       (0.103)    
##   citric.acid                                                    -0.079        -0.297*    
##                                                                  (0.104)       (0.120)    
##   pH                                                                           -0.475***  
##                                                                                (0.134)    
## ------------------------------------------------------------------------------------------
##   R-squared             0.227         0.317         0.336         0.336         0.341     
##   adj. R-squared        0.226         0.316         0.335         0.334         0.339     
##   sigma                 0.710         0.668         0.659         0.659         0.656     
##   F                   468.267       370.379       268.912       201.777       165.112     
##   p                     0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -1721.057     -1621.814     -1599.384     -1599.093     -1592.801     
##   Deviance            805.870       711.796       692.105       691.852       686.429     
##   AIC                3448.114      3251.628      3208.768      3210.186      3199.602     
##   BIC                3464.245      3273.136      3235.654      3242.448      3237.242     
##   N                  1599          1599          1599          1599          1599         
## ==========================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation.

The multivariate analysis revealed that high quality wines tend to have a higher quantities of alcohol and sulphates but a lower volatile acidity.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, I created a linear model. However, due to the fact that there is not a lot of data on the poor and high quality wines in this dataset the equation will not have a high confidence level for predicting quality. ——

Final Plots and Summary

Plot One

Description One

It is important to note the distribution of wine qualities obtained in this dataset. The majority of wines have a quality between of 5 or 6 A very big majority of wines (roughly 82%) have a quality five or six. There are very few low and high quality wines and absolutel none with a quality of 1,2,8 or 9. Due to the limited amount of data we have on wines that are not of an average quality it will be diffivult to determine what produces a high quality wine using this data alone.

Plot Two

Description Two

Three factors that were highly correlated with quality were alcohol, phosphate levels and volatile acid. The high quality wines were shown to contain a low amount of volatile acidity and higher amounts of alcohol and sulphates. The lower quality wines displayed high amount of volatile acidity and low amounts of alcohol and sulphates. Therefore thses attributes can be used to make a reasonable prediction of how a wines quality will be scored.

Plot Three

Description Three

The plots show that low quality wines tend to have a higher volatile acidity and lower sulphate and alcohol levels, while high quality wines display the opposite. The average quality wines suffer from quite a lot of overplotting which is due to the bulk of the data representing average quality wines. It is interesting to note that there is plenty of overlap for these three features for all of the different wine qualities. Since there is no distinctive barrier between the differing qualities other fcators must be taken into consideration when assesing a wines quality. One distinctive pattern that can be seen is that the majority of the high quality wines have much lower levels of volatile acidity than the low quality wines. Despite the fact that high quality wines can still be found to have a lower alcohol percentage, which is commonly found in lower quality wines, it seems unlikely that you will find a high quality wine that will have a high volatile acidity. The maximum amount of volatile acidity found in high quality wines is 0.915g/dm^3 whereas in low quality wines it can be as high as 1.58g/dm^3. However, these values are outliers as the mean volatil acidity for low and high quality wine is 0.72 and 0.4 respectively.

##    Low Medium   High 
##  1.580  1.330  0.915
##       Low    Medium      High 
## 0.7242063 0.5385595 0.4055300

Reflection

This analysis showed that although factors such as alcohol content, volatile acidity and sulphate levels can be good indicators of whether or not a wine will be considered a good quaity, there are other important factors at play. Some of the other factors that may go into determining quality may be things such as smells and flavours rather than just chemical properties. In addition to this many other things that aren’t necessarily related to the wine composition could affect how popular it is. Attributes such as price, branding, availability and manufacturing process can all greatly affect how popular a wine will be. In conclusion, a much more reliable study of wine quality could be conducted if the data set included more data on wines that fall on both the lower and higher end of quality as well as including several other additional attributes that would improve a prediction on wine quality a lot more.